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Abstract 

We propose and study a class-expansion/innovation/loss model of genome evolution taking into account 
biological roles of genes and their constituent domains. In our model numbers of genes in different 
functional categories are coupled to each other. For example, an increase in the number of metabolic 
enzymes in a genome is usually accompanied by addition of new transcription factors regulating these 
enzymes. Such coupling can be thought of as a proportional "recipe" for genome composition of the 
type "a spoonful of sugar for each egg yolk" . The model jointly reproduces two known empirical laws: 
the distribution of family sizes and the nonlinear scaling of the number of genes in certain functional 
categories (e.g. transcription factors) with genome size. In addition, it allows us to derive a novel relation 
between the exponents characterizing these two scaling laws, establishing a direct quantitative connection 
between evolutionary and functional categories. It predicts that functional categories that grow faster- 
than-linearly with genome size to be characterized by flatter-than-average family size distributions. This 
relation is confirmed by our bioinformatics analysis of prokaryotic genomes. This proves that the joint 
quantitative trends of functional and evolutionary classes can be understood in terms of evolutionary 
growth with proportional recipes. 

1 Introduction 

Protein-coding genes in genomes can be classified in both functional categories (e.g. transcription factors 
or metabolic enzymes) as well as "evolutionary categories" or families of homologous genes (to avoid 
confusion, in the following we will reserve the term "category" to functional annotations, and we will use 
the term "family" as a generic indication of homology classes, or domain families/superfamilies in domain 
data, see Methods). Functional categories are routinely composed of a large number of evolutionary ones. 
This distinction is illustrated in Fig. [l] where genes are characterized by both shape (functional category) 
and color (homology class) with each shape represented by multiple colors. Understanding the principles 
connecting these separate classifications of genomic material is an important step in order to disentangle 
the organization of the content of whole genomes. 

More specifically, studies of fully sequenced genomes revealed that their functional and evolutionary 
composition is governed by simple quantitative laws |lj[2j. In particular, for prokaryotes the number of 
genes in individual functional categories was shown to scale as a power law of the total number of genes in 
the genome [2 . Depending on the functional category the exponent of this scaling law varies from (for 
fixed sets of housekeeping genes) to 1 (for metabolic enzymes) and all the way up to 2 (for transcription 
factors and kinases) [2|[3|. Furthermore, the distribution of sizes of gene families (called "evolutionary 
categories" in our title) has a scale-free distribution with the exponent inversely correlated with the 
genome size [l]P)p] ■ The overall number of gene (or domain) families represented by at least one member 
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exhibits a slower-than-linear scaling with the total number of genes in a genome [6|[7]. Biologically, 
the growth of evolutionary families derives from combined processes of horizontal gene transfer, gene 
duplication, gene genesis, and gene loss |8 . For prokaryotes, horizontal transfer appears to dominate 
gene family expansion [oj, and the same process is presumably very important for the introduction of a 
new evolutionary family into an extant genome. 

The comprehension of these empirical laws requires to construct quantitative models that explore 
different design principles, or more prosaically the recipes by which genomes are built from elementary 
functional and evolutionary ingredients. In this study we introduce the first model to jointly explain 
observed scaling laws for evolutionary families and functional categories. 

Several theoretical models have been previously proposed to explain the observed power-law distribu- 
tion of family sizes [5 10 ■ 13 Most of these models are of class-expansion/innovation/loss type, abstractly 



mimicking basic evolutionary moves such as horizontal transfer, duplication, loss. We recently formulated 
a related model that in addition to family size distribution also explains and successfully fits the scaling 
of the number of distinct gene families represented in a genome as a function of genome size [6 



14 



On another front, the "toolbox model" of evolution of metabolic networks and their regulation re- 



cently proposed by one of us 15 offered an explanation for the quadratic scaling between the number of 
transcription factors and the total number of genes in prokaryotes. In this model, metabolic and regula- 
tory networks of prokaryotes are shaped by addition of co-regulated metabolic pathways. The number of 
added enzymes systematically decreases with the proportion to which the organism has already explored 
the universe of available metabolic reactions, and thus, indirectly, with the size of its genome. For the 
purposes of the present study, a key ingredient of the toolbox model is that events adding or deleting 
genes in multiple functional categories (in this case metabolic enzymes and transcription factors regulat- 
ing metabolic pathways) are tightly correlated with each other. The concept of coordinated expansion or 
contraction of functional categories can in principle be extended beyond enzymes and their regulators. 

One should note that this explanation of scaling of functional categories is conceptually different from 
that based on "evolutionary potentials" proposed in Ref. [sl . Evolutionary potentials quantify the intrinsic 
growth rates of individual categories. This means that in this model the growth of one functional category 
is represented as uncoupled from growth or decline in other functional categories. However, evolutionary 
potentials could also be the effective result of the coordinated expansion of multiple functional categories 
linked by interactions of biological and evolutionary origin (e.g. linking membrane proteins with signal 
transduction, etc.) On the other hand, it is clear that models with evolutionary potentials represent quite 
well the empirical data on the growth of functional categories, and thus it appears that this must be (at 
least) a very good effective description, that any more detailed model needs to reproduce. 

This study brings together the basic ingredients of class-expansion/innovation/loss models [6|[l4] 



and coordinated growth of functional categories 15 . The resulting combination allows us to study the 



interplay between the scaling of evolutionary and functional categories. In particular, we mathematically 
derive a relation between the exponents characterizing these two scaling laws. It predicts that functional 
categories that grow faster-than-linearly with genome size are characterized by flatter-than-average family 
size distributions. This prediction of our model is subsequently verified by our analysis of functional 
and evolutionary scaling in empirical data on sequenced prokaryotic genomes. Finally, we analyze and 
discuss the alternative combination of a class-expansion/innovation/loss model with growth of functional 
categories dictated by evolutionary potentials. 

2 MATERIALS AND METHODS 

Models 



The model represents a genome as a list of genes, which is partitioned in homology families and functional 
categories. Genome evolution is modeled as a stochastic process where the elementary moves can be 
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Figure 1. Scaling laws in joint functional/evolutionary partitioning of genomes. Genomes are 
partitioned into families of homologous genes (colors) and functional categories (shapes) . (A) The 
number of unique evolutionary categories (domain families) (y-axis) scales sub-linearly with the genome 
size (x-axis.) (B) Cumulative histograms of domain family size (see Figure |4]). (C) The number of 
transcriptional regulators (red), metabolic enzymes (blue), and housekeeping genes responsible for 
translation (green) plotted as a function of the genome size measured by the total number of domains. 
Symbols in all the plots are empirical data for protein domains in 753 fully sequenced bacterial genomes. 



any of two types: (i) a "family expansion" or "duplication" move in which a new domain is placed 
in an evolutionary category (family of homologous domains) already present in the genome or (ii) an 
"innovation" move in which a new family with just one domain appears in a genome (e.g. by the virtue 
of horizontal gene transfer). 

We would like to emphasize that in the tradition established in "duplication-innovation-loss" models, 
which we follow, the family expansion move is customarily labeled as duplication. In reality this move 
can come either by the virtue of gene duplication or by horizontal gene transfer, which appears to be 
the dominant class-expansion mechanism in bacteria [o] . The overall family size in all genomes might be 
generating an effective "preferential attachment" for HGT events (see Refs. [3 16 and open comments 
by referees therein). 

Although genes are natural objects of this kind of description, it is not simple to use genes as central 
units in the analysis of empirical data, mainly due to the fact that gene dynamics is complex and may 
contain events of gene fusion, splitting and internal rearrangements. Thus, as in some previous analyses, 
we will compare the models with data on protein domains Islro], which have the important property that 
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Figure 2. The number of evolutionary (domain) families belonging to a functional category follows a 
linear law in empirical data, denoting a possible invariant of genome composition. The left panel plots 
the data on for the number of families fc in the ten largest functional categories on all genomes, 
following the trend fc — Ac + Xcf, where / is the total number of families on the genome. Symbols are 
empirical data for 753 fully sequenced bacterial genomes. The offset Ac is large only for the 
"translation" category. The right panel is a plot of the coefficients Xc obtained from the same data 
(subtracting the offset Ac obtained from a linear fit), as a function of genome size in domains, n. See 
also Supplementary Table SI. 



they cannot be split into smaller units 17 . Domains are modular building blocks of proteins and it has 
been argued that they effectively work as the natural atomic elements in genome evolution [4] . Concerning 
the scaling laws, domains appear to have the same qualitative behaviour as genes. Throughout the paper, 
we will be comparing the models with data on 753 bacteria from the SUPERFAMILY database 18 . The 



models will be formulated for abstract atomic elements that could be genes or domains, and possible 
relevant issues when dealing empirically with genes will be addressed in the discussion. In describing the 
models we will generally refer to these units as genes. 

Technically, in order to compare with the protein domain data we rely on simplifying assumptions 
on the domain composition of proteins. Obviously the situation is more complex than this. We have 
verified in the data that the number of TP domains are linear in the number of TF genes (Supplementary 
Figure S5|, with slope 1.09 (average number of TF domains in a TF gene). A second assumption is that 
the number of families belonging to a functional category is linear in the total number of families. This 
assumption is in accordance with data (see Figure [2] and Supplementary Figure SI). In particular, we 
observed this trend for the number of transcription factor superfamilies (see Supplementary Figure S2 ) . 



Standard Chinese Restaurant Process. 

The starting point is a class-expansion/innovation process for the homology families that reproduces 
qualitatively the empirical scaling laws |6j. This process (known in mathematical literature as "Chinese 
Restaurant Process" or CRP [l9]) defines a growth dynamics for the partitioning of a set of elements 
(genes or domains) based on two basic growth moves. Traditionally the CRP model is defined by two 
parameters a and constrained by < a < 1 and > —a. The moves are quantified and defined by two 
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probabilities po and pat of duplication and innovation respectively. 

• The class-expansion probability Pq of a domain family i is proportional to the number of family 
members currently in the genome offset by a: Pq ^ rii — a (see Table [l]). 

• The innovation probability Pat is the probability of adding a new domain family with one member. 
It corresponds to a new domain family appearing in a genome by de novo evolution or horizontal 
gene transfer. The CRP model assumes pm ^ aj + 0, where / is the total number of domain 
families present in the genome. 

The normalization condition p^ + X]i-Pb ~ ^ determines the pre-factor in both equations to be l/{n-\-6). 
A gene loss move does not seem to be essential for the basic qualitative results. Indeed, if stochastic 
(uniform) gene loss is incorporated into the model it results only in renormalization of parameters po 
and Pat [if] ■ 

We explore the model by direct simulation and by solving continuous "mean-field" equations [6 14 



that describe the mean behaviour of the number of homology families and functional categories, and the 
statistics of the population of families and categories. 

CRP model incorporating functional categories. 

In order to introduce functional categories into the CRP, one has to specify po and pj^ for different 
categories. We first assume that the probability of introducing a gene of a specific functional category by 
the innovation move is independent of genome size. This assumption implies that the number of homology 
families of a given category scales linearly in the total number of families, and is justified empirically 
for some functional categories by domain data (see Figure [2] and Supplementary Figures SI and S2 ) 



Equivalently, = XcPn, where Xc is the probability of introducing a new family of the category c. In 
other words, it is assumed here that every time a new family is added, the probability that it will belong 
to category c is Xc- 

Under this assumption, the mean-field equation describing the growth of a family of homologous 
domains (evolutionary category) is 

/ 

C{n)dnni ^'^OijUj - a. (1) 

Here the genome size n is used instead of time and averages over multiple realizations of a process are 
implied. The novel ingredient of the model - coordinated growth of functional categories - is encoded 
in the coefficients a^- responsible for correlated duplications between evolutionary families i and j. We 
assume a^j to depend only on functional roles of families i and j. The equation describing the growth of 
/ - the number of distinct families in a genome is the same as in a standard CRP model. 

C{n)d^f = {af + 9) . (2) 

The function C(n), which sets a natural time scale for the process, is determined by the normalization 
condition (9„ri = 1, i.e. J^i dnTii + dnf — 1. 

For the specific case of categories of transcription factors (TFs) regulating metabolic processes and 
their metabolic target enzymes, the necessity of a correlated move can be argued along the lines of 



Ref. 15 . A set of new targets has to be added to incorporate a new metabolic function. This entails 
the addition of a new metabolic pathway that is long enough to connect a new nutrient to a previously 
existing pathway, that further converts it to a central metabolic "core network" . Supposing that each 
newly added branch is controlled by only one added transcription factor, since the length of the branch 
becomes smaller with increasing size of the organismal metabolic network (compared to a metabolic 
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"universe"), on average, increasingly more TFs per target will be needed in order to control newly 
incorporated branches. 

More generally, functional, genetic and epistatic interactions can create the correlated growth of 
different functional categories of genes. In the discussion section we provide the empirical evidence of 
statistically significant correlations between various functional categories. 

Following the recipe outlined in Ref. 15 we consider a simplified version of the model involving 
only two functional categories: 1) TF- transcription factors controlling metabolic processes; 2) met - 
metabolic enzymes they regulated. As in the toolbox model, changes in utf and n^et ar^ coordinated 
with correlation coefficients Uij given by 

= , aji = 0; tor I ^ J 

and an — . 

Here U is the size of the metabolic universe, i denotes any gene family from functional category TF, and 
j - from the functional category met. In this variant, addition of transcription factors can only occur 
conditionally to the addition of metabolic enzymes. In the following, we will refer to this model variant 
as model la. We define a second variant of the correlated model (model lb), which is a more direct 
extension of the standard CRP model, and thus can exploit previous mean-field theory analytical results. 
In this case 

fljj = , aji = 0; for i ^ j 

and an — 1 , 

(where i again denotes any gene family from functional category TF and j - from the functional category 
met). In this model variant, all families (and hence also transcription factors families) have an equal 
intrinsic growth rate on top of the correlation. If aij =0, i j the model is equivalent to the standard 
CRP. Finally, we also considered a model (model II) where correlations between functional categories are 
absent, but instead members of a given functional category are added at a category-dependent intrinsic 
rate as prescribed by "evolutionary potentials" of Molina and van Nimwegen (in this case, aij = for 
i j, and an = Pc(i)j where c{i) is the functional category to which family i belongs, and is the 
evolutionary potential of class c) . These results are discussed later on in the manuscript and compared 
to to the two "correlated duplication" models above (see Discussion and Supplementary Text). 

To resume, two kinds of models are considered here: "correlated recipes" , where the scaling exponents 
can only result from interactions between categories (model la and lb, the main focus of our study), and 
"absolute recipes" (model II), leading to different intrinsic growth rates for different categories. Correlated 
models might contain an specific intrinsic growth rate of the classes, equal for all classes (model lb), or 
not (model la). We will see that the important distinction between model I (a and b) and model II is 
that the different scaling exponents for functional categories are a result of correlations and not absolute 
class expansion rates. 

Data 

Data on superfamily domain assignments and superfamily functional annotations for the 753 Bacteria 
were obtained from the SUPERFAMILY (vl.73) database [18] . The database contains 1291 different 
domain superfamilies grouped into 47 different functional categories (60 families do not belong to a 
specific category). These categories are divided into 6 larger groups (Metabolism, General, Regulation, 
Information, Initiation Complex Processes and Elongation Complex Processes, see also 
|http://supfam.cs.bris.ac.uk/SUPERFAMILY_1.73/function) . 
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Evaluation of exponents in empirical data 

We considered the normalized cumulative histograms (families with more than d members) and non- 
cumulative histograms (families with exactly d members) of the populations for all evolutionary families 
(related to exponent f3, see Results), and those restricted to the families belonging to each of the main 
functional categories indexed by c (related to the exponent /3c, see Results). Exponents were estimated 
by fitting the data with a power-law, restricting to a window where the x axis value was less than a 
cutoff value, as in Ref. 1^. The cutoff was chosen for each fit, by minimizing the chi square residuals 
with varying window size. This procedure was implemented with a custom CINT (CH — h) script using 
the ROOT software. Figure [6] is obtained considering the fitted exponents for the histograms of the five 
largest genomes (where the "finite-size correction" is smallest, see Figure [s] and Ref. [l4].) 



Empirical correlations among functional categories 

Correlation between families (or categories) populations were calculated from the deviations from the 
average trend. We obtained the frequency of a family /category in every genome, defined as the ratio 
between the population of a family in domains and the total number of domains assigned on that genome. 
Subsequently for every family /category, we extracted an average trend as a function of genome size n 
using a sliding- window histogram (with window size of 280 domains and resolution of 28 domains) , and we 
considered the deviation of each genome from the average trend at its value of n. The Pearson correlation 
of these deviations over all the genomes was considered between each pair of families/categories (Figure 
[Tjand Supplementary Table S3 and S4). 



Models and simulations 

The quantitative duplication-innovation evolutionary models were explored by a mean-field analytical 
approach and direct numerical simulations. The mean-field approach considers equations for the means 
of the observed quantities in the large-n approximation. In parallel with the mean-field analysis, we 
performed simulations of the main model and its variants. The realizations depend on the following 
parameters, (i) The parameters of a standard CRP, a and 9. (ii) The parameter i-G- the probability 
that a new family belongs to a given functional category. This parameter can be inferred from data (see 
Results and Figure [2]). For example, for the case of transcription factors and targets, we defined xtf 



from the slope extrapolated from Supplementary Figure S2 giving xtf — 0.035 (see also Supplementary 



Figure S6). (iii) Initial conditions, represented by initial configuration (number of leaves, number of TFs 
and number of families in both categories). We have used the configuration of the smallest bacterium in 
the dataset (Candidatus Carsonella ruddii). An alternative choice could be the minimal intersection of 
all genomes in the database, (iv) Variant-specific parameters, that amount to the evolutionary potentials 
Pc for the first variant of the model, and the correlation matrix between functional categories, aij for 
the second variant. Simulation results are typically visualized in boxplots in order to compare the means 
with the probability distributions. In these plots bars correspond to (in order) the smallest observation, 
lower quartile, median, upper quartile, and largest observation. 



3 RESULTS 

A new invariant of genome composition 

We found (Figure [2] and Supplementary Table See also Supplementary Table SI) that the number of 
evolutionary domain families forming a functional category follows a linear law in empirical data, denoting 
a possible invariant of genome composition. This also implies that the mean law dnfc = Xcdnf assumed 
in the model is justified by the data. This does not mean exactly that the fraction of all families belonging 
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to a certain functional category is constant. Rather, the observed law can be fc = Ac + Xcf, with an offset 
Ac representing a minimal amount of evolutionary families required to build a given functional category. 
In empirical data, this offset appears to be large only for the "translation" functional category. 

The model captures the combined scaling laws 

Numerical simulation and mean-field analytical solutions of the correlated growth model (model I) re- 
produce very well both the empirical behavior of the TFs scaling law and the statistics for evolutionary 
domain families (Figure [3] and Supplementaty Figure S4). We found no significant qualitative differ- 



ence between models la and lb regarding these observables. Furthermore, the joint scaling laws can be 
reproduced also with an uncorrelated model (model II), with minor technical difficulties (see Discus- 
sion). The correct asymptotic quadratic scaling can be obtained from mean-field arguments for both 
model I and II. These arguments are presented in the Supplementary Text. In order to illustrate this 
point we consider for example model lb. Starting from Eqs. [l] one has to sum over all domain families 
from functional categories TF and met. Since utf — J2ieTF depends on the number of TF classes, 
one must have for its derivative dnfiTF = J2ieTF ^n^i + dnfxF- Combined, these two equations give 
driTF / drimet — "^iriTF — (y)/{nmet — a) ~ 2nTF/nmet, or finally the quadratic scaling titf ^ "^Imet- 

Altogether, the agreement between data and model is universal, in the sense that the same three 
parameters are sufficient to predict family/category numbers and populations for all genomes in the 
dataset. Moreover, the comparison does not rely on the adjustment of any hidden parameter. It is also 
worthwhile noting that, while the input of model I (a and b) is built to give an asymptotic power-law 
scaling exponent of two for transcription factors (which is reproduced by the mean-field approach), at the 
relevant genome sizes the model automatically reproduces the correct empirical exponent (about 1.6 in 
the SUPERFAMILY data) as an effect of the finite system size. Note that in model lb TFs can duplicate 
both spontaneously (uncorrelated move) and following spontaneous duplication of targets (correlated 
move) , corresponding to the terms an and in equation [ij while in model la this does not happen. 

The extension of the model to more than two categories requires to know the laws through which 
families of different categories are correlated with each other. Supplementary Figure |S3| compares the 
results obtained by a correlated duplication model formulated with three categories (TFs, met, others). 



Table 1. Basic model quantities and notations 



Quantity 


Meaning 


a , e 


CRP model parameters 


n 


Genome size quantified by its total number of domains 


n. 


Number of domains in tlie family i 




Number of domains in the functional category c 


/(") 


Number of families in a genome of size n 


Un) 


Number of families in a genome of size n belonging to the functional category c 


f{d,n) 


Number of families with exactly d members in a genome of size n 


fc{d, n) 


Number of families belonging to the functional category c with exactly d members in a genome of size n 


13 


Exponent of the family-population histogram 


/9c 


Exponent of the family-population histogram restricted to category c 


Xc 


Probability to introduce a new family of the category c (empirically quantified by the slope of vs. /) 


Cc 


Exponent of the scaling of the size of functional category c vs. genome size n 
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Figure 3. Comparison between 1000 realizations of the correlated duplication model lb at a = 0.3 and 
6 = 140 (blue boxplot) with empirical data (red). The left panel is a plot of the number of distinct 
domain families versus genome size. The fact that the number of families does not saturate is a 
property of the standard duplication- innovation model (see [6] for a complete discussion). The right 
panel plots the number of TF domains versus the total number of domains, showing that the scaling of 
the transcription functional category is well reproduced (exponent ~ 1.6.) See Supplementary Figure S4 
for model la. 



Prediction of the exponents of the family-population histogram restricted to 
single functional categories. 

While the agreement between model and data shows that the scaling of functional and evolutionary 
categories can be understood jointly, it does not provide by itself any substantially new information 
about how the two partitionings interact. Further insight can be obtained considering the distributions 
of the number of domains per family for different evolutionary families belonging to the same functional 
category. In general, the population of domain families of a genome follows a near power law distribution 
whose slope depends on genome size (Figure |4]). The mean number f(d,n) of domain families having 
d members at large genome size n is well described by the slope (see Figure |4]) , and thus the 

cumulative histogram by Q{d,n) ^ 1/d^ , where the fitted exponent /? typically lies between and 1. 
The standard CRP predicts this behavior (6|[l4]. The model described here here allows to consider the 
same histograms restricted to specific functional categories (Figure |4] and Figures [6]). 

A mean-field calculation (see Supplementary Text) based on the model variant with correlated dupli- 
cation predicts that the different trend of domain population histograms for transcription-factor families 
scales as f{d,n)TF ~ 1/fi^+T (see Figure [5|. Thus, the ratio between the exponent of the cumulative 
histogram of all families and the exponent of the cumulative histogram restricted to families belonging to 
the transcription factor category is predicted to be equal to the mean-field exponent for the scaling of the 
functional category. Specifically, Q{d,n) scales as 1/d^ whereas QTF{d,n) scales as d"^/^ and thus the 
ratio of exponents is /3/{/3/2) — 2, and this matches the asymptotic scaling of the number of transcription 
factors. More in general, the model indicates that each time the per-family duplication probability for a 
functional category takes the form Pq ~ Cc^c where ric is the total population of the functional category c, 
the coefficient will appear in the equation for P{d)c, the (cumulative) distribution of families belonging 
category c. This causes the relationship /3c = /3/Cc and appears to be robust with respect to the choice of 
a specific model (see Supplementary Text). In other words, a precise quantitative relationship must exist 
between the scaling exponent of a category and the slope of the family population histogram restricted 
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to the same category. Functional categories that grow faster-than-Unearly with genome size wiU have 
flatter-than-average domain family size distributions. Conversely categories growing slower-than-linearly 
will follow a steeper-than-average slope. 

Accordingly, a strongly visible trend should be expected in empirical data from families belonging 
to the transcription factor category, which scales with exponent 2. Indeed, the empirical population 
histograms for the transcription factor functional category for all the genomes in the data set have 
a slope that is spectacularly different from the global one (Figure [s] and Supplementary Figure S13). 
Quantitatively, this observation is in excellent agreement with predictions (Table [2]). Direct simulations of 
the correlated model reproduce well both the behavior of the histograms at given size and the dependency 
on genome size (see Figure |4]) . 

More generally, one can test the prediction Cc = P/ Pc with an empirical evaluation of many functional 
categories (Figure [6]). The agreement of empirical data with the predicted behavior is reasonably good, 
keeping in mind that many functional categories are composed by few or poorly populated families, and 
in these cases the data might not follow a scaling law that is as clearly defined as the metabolism or the 
transcription factor categories. 



4 DISCUSSION 

Population of evolutionary families of a given functional category 

We have presented the first combined quantitative description of the partitioning of genomes in both 
evolutionary families and functional categories. The results show that a theoretical framework that 
correctly reproduces both the scaling laws for functional categories of genes/domains and the scaling 
laws for gene/domain families (numbers and histograms) is possible. Biologically, this finding can help 
us understand the large-scale architecture of a genome in terms of its functional content. 

Analyzing the data in order to formulate the model, we found that the number of evolutionary 
domain families forming a functional category is linear in the total number of domain families (Figure [2]). 
Thus, the genomic subdivision of evolutionary classes in functional categories appears to be arguably the 
simplest possible, if one disregards the class population. This ingredient was taken as an assumption for 
all the models considered here, which the data fully justify. 

The model leads to the nontrivial prediction that connects the growth exponent of a functional 
category to the slope of the population family histogram restricted to the same category. In other 
words, the populations functional categories and evolutionary families of genes are connected by a simple 
quantitative law. Specifically, the ratio between the exponent of the cumulative histogram of all families 



Table 2. Prediction of the exponent of the family-population histograms restricted to singular 
functional category. Comparison between expected and observed ratio of the exponent of the 
cumulative histogram of all families and the exponent of the cumulative histogram of 
transcription- factor families ( Figure [6]), for the five largest bacteria in the SUPERFAMILY database. 
The ratio can be compared with the mean-field prediction of 2, or directly with the empirical exponent 
of the transcription factor functional category (1.6). 



Genome 




Ctf 


Sorangium ccUulosum 


1.72 ±0.1 


1.6 


Burkholderia xenovorans 


1.63 ±0.08 


1.6 


Burkholderia 


1.54 ±0.13 


1.6 


Solibacter usitatus 


1.46 ±0.05 


1.6 


Bradyrhizobium japonicum 


1.59 ±0.11 


1.6 
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Figure 4. Empirical data and simulations for the normalized domain family population cumulative 
histograms. The histograms are defined as the fraction f{d, n)/f{n) of families with more than d 
domains. (A) Empirical data for the 753 bacteria in the SUPERFAMILY database (each color is a set 
of genomes with similar sizes). Left panel: domain family population cumulative normalized 
histograms. Right panel: normalized cumulative histograms restricted to domain families belonging to 
the transcription factor functional categories. Note that the histograms slopes are different. (B) 
Simulations for domains family population cumulative histograms of CRP with correlated duplications 
run at a = 0.3 and 9 = 140. The plots in the two panels are defined as in (A). (C) Comparison between 
simulations of the correlated duplication model variant run at a = 0.3 and 9 = 140 (black lines) with 
empirical data (orange lines) for the largest genome sizes (5000 < n < 8500). Left panel: global 
normalized cumulative histograms of domain family population. Right panel: normalized cumulative 
histograms restricted to transcription factor domain families. 




Figure 5. Exponent of evolutionary families and genome size. Fitted exponent of domain family 
population cumulative histograms vs. genome size, for the 753 bacteria in the SUPERFAMILY 
database for TF families (red circles) and all families (black squares), obtained by a fitting method 
giving a lower weight to the tail in order to keep into account the cutoffs (used in Ref . 14 ) . 



and the exponent of the cumulative histogram restricted to families belonging to a functional category is 
predicted to be equal to the exponent for the scaling of the functional category. 

To generate this prediction, we have analyzed in detail the case of transcription factors, where the 
exponent of the population histogram is halved due to the quadratic scaling using mean-field calculations 
and simulations, and verified that it holds in general by simulations of both model variants. Empirical data 
on transcription factors follow this behavior remarkably well, showing population cumulative histograms 
of transcription factor superfamilies decaying with halved exponents compared to the global populations. 
The fatter tails of the TF histograms might also be related to the fact that only a few highly populated 
DNA-binding domain superfamilies dominate the population of TF DNA-binding domains and determine 
the scaling laws (Supplementary Text and Supplementary Figures SIO and Sll). More in general, we 
have also compared the behavior of domain family population histograms for all the empirical functional 
categories with the prediction, obtaining results that are in good agreement (Figure[6]), in particular for the 
highly populated categories, where the fitting procedure gives the highest confidence. The only highly- 
populated category that significantly violates this general trend is small molecule binding, a category 
composed of very few highly-populated domain families. This category is known to follow peculiar 
evolutionary laws, with high mobility of domains across the metabolic network, resulting in members of 
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Figure 6. Linear relation between Cc and 1/ (3c- Our theory predicts ~ P/ Pc (solid line). The 
empirical value of /3 = .74 is calculated from the family population histograms of the five most 
populated genomes. Symbols (circles and triangles) are empirical data for 38 functional categories (see 
also Supplementary Table S2). Triangles represent the ten most populated categories, where the 
estimated exponents are most accurate. The outlier is the "small molecule binding" category known to 
follow peculiar evolutionary mechanisms |20 



the same family being scattered across different pathways and producing lineage-specific domain families, 
with frequent re-invention of the same function by different families 20 21 . Thus, the exception makes 



biological sense, and can be understood in terms of members of evolutionary classes "jumping" to different 
functional categories with high rate during evolution. 



Correlated and absolute recipes 

The central ingredient of our main model (model I) is the coupling between addition/removal of genes 
in different functional categories. From a biological standpoint, it is reasonable that gene repertoires 
of functional categories related to each other via shared tasks, pathways or processes should follow 
coordinated rules [s]. In order to further justify this assumption, we probed directly the empirical 
domain data for correlation between number of domains in different functional categories,. To this end, 
for each genome g we calculated the deviation Sndg) between the functional category size {nc{g) and its 
average size in genomes of comparable size (see Methods) . We then calculated the matrix of correlations 
between values of Sric for different functional categories c. The results are reported in Figure [7] and 
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Supplementary Tables S3 and S4. We also tested that this procedure for evaluating the correlation was 



not dependent on genome size (Supplementary Figure S9 ) The metabolism categories appear to be 



highly (anti-)correlated with each other, probably because of the role they play in different pathways of 
a common metabolic network [15| . The observed correlations between metabolic families might also be 
relevant for reproducing the correct tail of the family population histogram restricted to the metabolism 
category (Supplementary Figure S3 1. 

An alternative approach is a description where the growth of each category is governed by intrinsic 
"evolutionary potentials" [s]. We have also analyzed such a description in some detail (see Supple- 
mentary text and Supplementary Figure S3). Despite of minor differences, a model combining class- 
expansion/duplication/loss with uncorrelated moves for the functional categories, model II, can also 
perform well in reproducing the joint scaling law and in predicting a relationship between the scaling ex- 
ponents and the functional categories. In particular, this means that the latter result should not by itself 
be considered a piece of evidence in favor of a correlated recipe. Figure [8] illustrates the basic differences 
between the two descriptions. The evolutionary potentials approach generically requires a lower number 
of parameters, but suffers from the tedious technical problem that the values of the growth coefficients 
cannot be controlled directly, because of the scaling of the normalization constant with genome size (see 
Supplementary Text and Supplementary Figure S7). The correlated model is technically more under 
control, since its behavior does not rely on any unknown normalization constant. For this reason, it 
also performs better with functional categories that grow faster than linear with genome size, such as 
transcription factors. On the other hand, such a model can be formulated with very few parameters only 
when a synthetic description for the correlations, such as the toolbox model, is provided. 

Here, we have considered mainly a model with three categories (transcription factors, metabolic, and 
others) and one nonzero correlation between metabolic domains and transcription factors. In general, 
specific biological details of how categories are correlated with each other determine the scaling exponents 
relating their genome fractions to each other and genome size. Note that the task of formulating a 
correlated model for many categories requires a knowledge of how the different functional categories are 
"slaved" to each other. This structure is largely unknown quantitatively, and can in principle define 
an arbitrarily complex network of interactions, since many categories may correlate with many others 
in potentially complicated ways. Should the importance of correlated recipes be confirmed by further 
analysis, it seems likely that the full formulation of such a description would still require to solve this 
problem. In order to show explicitly that the model can in principle be successfully extended to many 
categories (and still give scaling laws) we have analyzed the case of a simple hierarchical structure where 
many categories are slaved to a main one (see Supplementary Figure S8 ) . 

Overall, since functional categories scaling laws effectively emerge from the correlated approach, a 
good reconciliation of the two approaches could be to interpret the evolutionary potential model as an 
emergent description (which can be very useful in concrete empirical applications). In other words, 
evolutionary potentials would describe emergent effective growth of functional categories of a genome, 
averaging over more "microscopic" evolutionary processes where addition of genes belonging to specific 
functional categories needs to comply to constraints combining different functions to perform specific 
cell tasks. These kind of interactions between functions are better described by correlated growth of 
functional categories. In this view, genome growth would be governed by a relative recipe, where the 
proportions are more important than the exact amounts, rather than an absolute recipe, where only the 
detailed amounts of each ingredient play a role. 
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Figure 7. Correlation between the populations of 24 different metabolic functional categories from the 
SUPERFAMILY database for 753 bacteria. The correlation matrix is calculated from fluctuations of 
categories from the average trend (see Methods). Both correlation and anticorrelation are present 
between categories. Different metabolism categories are highly (anti-)correlated. 
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Figure 8. Models with correlated versus absolute moves. Top: the Chinese restaurant process (CRP) 
acts on the homology families (colors) with a duplication and an innovation move. It is extended here 
to include functional categories (shapes) Middle: model with evolutionary potentials. Functional 
categories are assigned differential duplication rates as in ref [Sj. Bottom: Model with correlated moves. 
Members of the functional categories are added proportionally between correlated pairs of functions 
(e.g. transcription factors and metabolic targets) as in Ref. p^. 
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Supplementary Text and Figures for Grilli et al. 



SI Description of the model and basic mean- field results 

This section discusses in more detail the analytical derivation of the scaling for the main observables of 
model I and II using a mean-field approach. 

Consider a joint partioning of elementary units (domains or genes) in functional and evolutionary 
categories, as illustrated in Figure [l] of the main text. The elementary units (in our case domains), 
belong to a single evolutionary family i, and every family i belongs to one and only one functional 
category c. 

The generic stochastic growth model considered here defines how new units are introduced into the 
system. The model is specified by a set of basic rates. The basic set of rates is constitued by the 
probabilities pi that a newly added unit belongs to a certain class i. More in detail, we define a probability 
Pq (where O stands for "old") that a new domain belongs to a family i which is already present in the 
system (i.e. having at least one member) and the probability pM (where TV stands for "new") that the 
added unit belongs to a family which is not already present in the system. 

The choice oi Pq and pm defines the model as a stochastic process for the basic observables (such as 
genome size n, family number / and its population rii, etc.), but one extra detail is needed. When a 
new class is introduced, the model needs to specify the category it belongs to. As discussed in the main 
text, in the model considered here a newly added family always belongs to a category c with probability 
Xc- The probabilities p^, pn and Xc can depend, in principle, on the number of units n and on their 
distribution in families, on the total number of families / and so on. Empirical data indicate (see Figure [2] 
in the main text) that Xc is a category-dependent constant, and thus does not depend on n. 

The mean-field approximation is useful to extract the basic information from the model [6j. In each 
realization of the full stochastic process, the probabilities of the possible configurations at time t+1 are 
determined by the configuration at time t. The mean-field approximation assumes that the configuration 
at time t is the average configuration. For example, if one is interested in the number of domains belonging 
to family i, the average number of elements 71^(^ + 1) at time i 4- 1 will be equal to the average number 
of elements riiit) at time t summed with the average number of elements added in a time step, i.e. Pq. 
For asymptotically large t this implies the approximate equation dtUi = Pq for the averages (here the 
averaging procedure is implicit in the notation). Since typically, at each step one and only one element 
is added, the mean number of elements \s n = t. If this is not the case, we can obtain 9„ni simply from 
dtUi divided by dtu. Considering n = t we obtain, for a generic model, the following mean-field equations 



dnfli = Po 

dnf = Pn 

dnfc = XcPN 

dnUc = 9n^ni dnU^ + dnfc ^^Ph + XcPN ■ 



(SI) 




Sl.l 



Models with correlations 



We now deal with the scaling of the basic observables in the model taking into account the correlation 
between categories growth (model I of the main text). 
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The correlation appears in the growth of the domain famihes of different categories. Thus the proba- 
bihty that a domain is added to a given family i can be written as 



The coordinated growth of functional categories is encoded by the coefficients a^ j , responsible for the 
correlated expansion of evolutionary families i and j (See Equation [l] of the main text). The standard 
Chinese Restaurant Process (CRP) is obtained by imposing a.^j = Sij (where Stj is equal to 1 if i = j 
and otherwise) . We assume that these coefficients depend only on the functional categories c and c' to 
which the families i and j belong. The probability of introducing a new domain is given by 

PN = —7 . (S3) 

SI. 1.1 Model la. 

We consider a model inspired by ref. [15) (the toolbox model, in which the growth of the number of 
transcription factors is coupled to the number of added metabolic enzymes), extended to describe a joint 
partitioning in functional and evolutionary categories. In the original version of the model the average 
increment of the main observables at each time step is 

- JL- 

nmet (S4) 

AriTF = 1 , 

and thus AnTp/Anmet = nmet/U , which gives a quadratic scaling for titf with Umet- 



Model la is an extension of the toolbox model is formulated following equation S2 by using a proper 
definition of ajj, such as the same equation of the toolbox model is valid. We observe that, for our purpose, 
the time step of equation we can be defined arbitrarily, as genome growth is eventually parameterized by 
n. Rewriting the equations as 

{AfLyyiQi — ^^met (S5) 
AriTF = nmet ^TJ^ , 

gives the summed probabilities relative to the two categories 

^raet ^fmet 



Po 



C{n) (S6) 



Po '■— J2i£TfPo 



C(n) 



while 

PN = . (S7) 

Accordingly, we extend the model to an arbitrary number of families by the choice Uij = "^"^ if 
I is a TF family and j a metabolic family and zero otherwise. This gives 

"met . _ ^ 

^ Z^^eme^jw^^j if J eTi^ 

, n-i — a ^ ' 

Pq — — 7 if i S met . 

This model gives the asymptotic quadratic scaling of utf with rimet by definition, using the exact 
same argument as the toolbox model. Other results have been obtained numerically (see Supplementary 
Figure ISil). 
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SI. 1.2 Model lb. 

This second formulation of a model with correlated recipe (model lb) imposes a different correlation rule. 
For example, consider the model involving only two functional categories, transcription factors controlling 
metabolic processes and metabolic enzymes. 

In this variant the coefficients a.^j have both a diagonal and a non diagonal part, a^j = 6ij + bi j. 
If & = the model is the standard Chinese Restaurant Process. For this reason, model lb is simpler to 
treat analytically, exploiting previous results. This work focuses mainly on the case j = ni/n^-^^t if 
i is a family from the functional category of transcription factors and j is a family from the metabolic 
functional category (and bi,j = otherwise). 

In this case, the summed probabilities Pq relative to the two categories are 



V a 



'E{.j=i o-ijrij + 9 

if i e met. 



j=i ^ " ('gg^ 

rij — a ^ ' 



Using the definitions given in Equation |S1[ one can see that, 

C{n)dnnTF ^ riTF +nTF ~ afrF + C{n)dnfTF ^ 2nTF ~ afrp + afrF + 9xtf = 2nTF + 9xTF , (SIO) 
while 

C{n)dnnmet = rimet - a/met + C{n)dnfrnet = Timet + 9Xmet ■ (Sll) 



Hence, for large n, since dnfc = XcPn — ctfc, the terms in the r.h.s. of Equations (SIO) and (Sll) cancel, 
giving the effective equation, 

^ , (S12) 

and thus the scaling utf ^ ^mei- 

SI. 2 Model II (model with evolutionary potentials) 

This section presents in more detail the uncorrelated version of the model for the joint scaling (model 
II), assigning evolutionary potentials js] pc to the functional categories, related to the probability that 
a gene added in a functional category is fixed by natural selection. This model is an example of an 
"absolute recipe", since each category grows with an intrisic rate Pc, summing up the growth of the 
families belonging to the given category. The rate pc acts on family growth through the class-expansion 
move. The probability of class expansion of a family belonging to the category c is equal to 

Pc(i)Tii — a , , 

Ph - (S13) 

where = Pc if the evolutionary family i belongs to the functional category c. This model assumes 
that the value of Pc(*) depends only on the category to which family i belongs. The probability that a 
domain belonging to category c is added by class expansion is then 



^ec 1^3 = 1 Pc{j)Th 



Equally, the probability that the new domain is introduced by an innovation move (i.e. it belongs to a 
new family) is equal to 

PN = —f -■ (S15) 

T,j=i PcU)nj + 9 
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Under the assumption (confirmed by empirical data, see main text) that the growth of old functional 
categories by adding new homology families through the innovation move is uniform (i.e. that fc = 
Ac + Xcf)7 the probability that a new family belonging to the category c is added by an innovation move 
is 

af + _ afc + 0X c 



Pn XcPN = Xc—f = —7 -■ (S16) 



Evolutionary potentials can reproduce the combined scaling laws at finite sizes. 

We tested this model by a combination of mean-field analytical arguments and direct simulation. 



The mean-field equations are obtained from Equation SI by using Equations S13 and S15 The 
equation for the growth of the mean number of members ric of a functional category can be obtained 
simply by summing on the homology families that belong to a given category, 

PcUc + OXc 

dnfic = — y:^. — , S17) 
C n) 



where C{n) ~ J2iPi''^i- If C'(n) ~ n, equation (S17l corresponds to the evolution equation written by 
Molina and Nimwegen. Simulations of this model (see Supplementary Figure S7) confirm that this is the 
case. Thus, the mean-field argument predicts that this model can reproduce both scaling laws. 

Also note that a rescaling of C(n) is equivalent to a rescaling of a. Indeed, for large n, pm ~ af /C{n) 
(and po = 1 — PN)^ so imposing C{n) ~ qn is equivalent to dividing a by the constant factor q. Thus, 
one can choose 9 = 1 without loss of generality (by a rescaling of all the pc), and the solution for the 
population of a functional category will be ric '-^ nP"/'^ as in the Molina/Nimwegen model, and thus 

Cc = Pc/q 

On the other hand, an important point regarding this model is that, asymptotically for any choice 
over the pc set, the maximum large-n exponent observed will be 1, Indeed, we can use the approximation 
C{n) — J2iPi'^i Pma^n''''/'^ , but C = qn, so that q = Pc„,a^- This means that an exponent close to 2, 
such as that observed for transcription factors can only be obtained in a transient regime of the model. 
Furthermore, the change of the evolutionary potential of one functional category has repercussions on 
the other categories, as it implies a change in the normalization costant C. These facts make a direct 
identification of the value of the evolutionary potential with an intrinsic propery of a single functional 
category difficult. They also make the direct identification of evolutionary potentials less straightforward 
(as it requires an arbitrary rescaling). 

However, the above remarks have little practical importance, and the large-n behaviour of the model 
does not really affect its performance at the relevant values of n. Numerical simulations show that at the 
empirical genome sizes, the scaling behaviour of the model can reproduce rather well the empirical one. 
For simplicity we have restricted to three main categories (transcription factors, metabolic genes and 
"others") and we verified that in practice it is not hard to find a parameter set in good agreement with 



the empirical data on protein domains (Supplementary Figure S3). The general number of parameters 



to adjust increases with the number of functional categories that one needs to consider. 

S2 Exponents of family size distribution histograms 

This section discusses the family size distribution histograms, as obtained from the mean-field approach. 
To fix the ideas, we will focus on model lb, where the mean-field equations can exploit the known 



results from the GRP. It is possible to write a mean-field "flux equation" for the histograms 14 , which 
implements the fact that each duplication adds a family with one extra member to the histogram count 
and subtracts a family with its previous population, 

dnf{d,n) =po{d- l,n)f{d- l,n) - po{d,n)f{d,n) +PNSd,i (S18) 
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where poid,n) — is the probability that a family with d domains add a new duplicated member. 

The term pn — "J^g contains the innovation probability contributing to the growth of the number of 
families with one member. Note that the flow between families can be written as 



ie 



1^ dnn, = id-a)^^. 

r families with^ 

< j domains > 



This equation requires an assumption on f{d,n) in order to be solved. We assume the ansatz f{d,ji) = 
P{d)f{n) which is justified by both simulation and empirical data 14 . Using the fact that dnf{n) — p^, 
combined with Equation |S18| gives the following equation for the probability of a family to have d members 

aP{d) = {d-l-a)P{d-l)-{d~a)P{d) , (S19) 

which can be solved in discrete or continuous d to get 



m ^ (^^ j . (S20) 

This predicts the asymptotic behaviour of data and simulations (see Figure |6]) with /3 = a, where (3 is 
the asymptotic exponent of the family size distribution. 

Let us now turn to the same distribution, restricted to transcription factors. In model lb, the flux 
from transcription factor families caused by family expansion is caused by two separate contribution, the 
CRP standard one, plus additions of transcription factors to an existing family caused by the addition of 
a metabolic enzyme 



^met 



ifieTF (S21) 



I.e. 



PhH = 7^ [2n. - a] , if z e TF. (S22) 

Thus, for the transcription factor families, the probability that a domain is added to a family with d 
members will be 

Pl'^{d,n)^-^ [2d~a] . (S23) 

The quantity pQ^{d,n) is the probability that a new transcription factor domain is added to a family 
with d members. The flux equation for TF families can be obtained by substituting equation |S23| in 
equation S18 (for rf > 1) 

Cin)dnfTFid, n) = [2(d - 1) - a] frrid - 1, n) - [2d - a] frrid, n) (S24) 

This is solved using the usual ansatz fTF{d,n) ~ Pt f (d) fx f (n) (as explained above it is confirmed by 
both data and simulations). Using /tf("-) — XTFf{n), leads to the equation 

aPTF{d) = {2d-2- a)PTF{d) - {2d- a)PTF{d) , (S25) 

which gives: 

P{d)TF ^ ' , (S26) 

that is /? = a/2 = /?/2. In the above calculation we have supposed again that the number of transcription 
factors is small with respect to to the total number of metabolic enzymes. 
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Furthermore, it can be argued that this fact is more general. Indeed, each time the per-family 
duphcation probability for the TF functional category will have the form 

Po - 2?^^ , 

when family i belongs to TF category, the coefficient 2 will appear in the equation for P{d)TF modifying 
the exponent. In particular, this will also be true for models la (generalizing the toolbox model) and II 
(generalizing evolutionary potentials). 

In other words, each time a functional category scales with a given exponent, it can be argued on 
rather general grounds that the exponent of the population histograms of the homology families that 
form it will be affected. It is possible to to generalize this argument, and find a precise relationship 
between the scaling exponent of a category and the family population histogram (restricted to the same 
category). In other words, if Cc is the scaling exponent of the category c and /3c is the exponent of the 



cumulative distribution histogram for the families belonging to category c, that is (see Equation S26 ) 

PM,~(^ 

we suggest that /3c = /3/Cc- We tested this prediction in empirical data plotting 1/(5^ versus Pc in Figure [6] 
(Pearson correlation coefficient 0.47). 



S3 Comparison of models by numerical simulation 
S3.1 Correlated and absolute recipes 

This section compares the correlated duplication and the evolutionary potential model variants. We 
considered a three categories model (TF, Metabolic and "other"). 

The evolutionary potential model needs to supply three parameters pc, while the correlated model 
needs to supply the correlation law between categories (a^). We impose a correlation only between 
transcription factor and metabolic families with the correlated model lb prescription, i.e. 

fly = njumet, (S27) 

where i is a TF family and j Metabolic, = (no correlation) otherwise. 

Figure [S3| summarizes the results of this comparison. The correlated duplication model performs better 
in reproducing the behavior of the transcription factor category (both scaling law and histograms). Both 
models are unsatisfactory in reproducing the family population histogram of the metabolism families. 
This is probably caused by the fact that neither model include a correlation between metabolic families 
(Figure [7|. 

Figure [S7| illustrates the behaviour of the normalization function C{n). C{n) is linear with n in the 
range of empirical genome sizes (although the slope is not exactly 1). It becomes nonlinear at larger sizes, 
and its linear behavior is restored only at very large values of n. 



S3. 2 Model I can reproduce a set of different exponents 

Extending a model (with absolute or correlated recipe) to a large number of categories is not a simple 
task. In the case of an absolute recipe model, adding a new category c' (and thus introducing a new 
evolutionary potential pc') generally requires, in order preserve the scaling of all the categories, a tuning 
of all the evolutionary potentials (both the old ones and the new one). This is due to the fact that all 
the evolutionary potentials appear in the normalization constant C(n) in the growth equation of each 
category (Equation (S13|. In a model with a correlated recipe, the main problem is related to the fact 
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that the interaction laws between categories are not known, they can be complex and possibly include 
feedback. 

In order to produce the proof of principle that a model with correlated recipes can work with more 
than three categories, we considered a trivial generalization of model lb to multiple categories that are 
slaved to a main one, and considered the question of whether this model would be able to reproduce an 
arbitrary set of scaling exponents for the categories. 

We consider a correlation matrix a^.j of the form S^ j + bij, where bij — 0. This model deals with C+1 
categories, the met category (in analogy with model lb defined in the main text, this is a category whose 
growth is not conditioned to the others), and an additional set of C categories labeled from 1 to C The 
non diagonal correlation coefficients hi, j are zero if family i belongs to the Ttiet category, and ^c{i)^i /^met 
if family i belongs to category c, different from met, and j belongs to the met category. Substituting this 
choice in equation [S9) gives 

;^ = (l + 7c)— (S28) 

and thus 

nc ^ nl+r . (S29) 



Supplementary Figure S8 shows simulations from a model with 10 + 1 categories. The model is able to 
reproduce an arbitrary set of exponents. We observe that this version has similar problems as the model 
with evolutionary potentials, as, in absence of a biological underlying model, it needs the tuning of a 
set of parameters to reproduce the scaling laws. The fitted exponent is typically different from 1 + jc, 
specifically it seems to be closer to one. We interpret this as a finite size effect, due to the fact that the 
contribution of innovation to the scaling exponents is relevant. 



S4 Details of TF-domain superfamily scaling 

We observe that the quadratic (or very nearly so) scaling for transcription factors is clearly visible at in 
the two most populated families of transcription factor DNA-binding domains (Homeodomain-like and 



Winged- helix) , which have a rather clean slope (see Supplementary Figure SIO). In fact, three families 
present a clearly observable scaling alone (Homeodomain-like, Winged-helix and C-terminal), but just 
the first two follow a very nearly quadratic scaling. 

Note however that removing the six most populated TF families, representing 80% of the total TF- 
domain population, the remaining ones considered together still present a scaling when added up, but 



with exponent ~ 0.9 (see Supplementary Figure Sill. This indicates that the collective scaling of TF 



families cannot be entirely recunducted to properties of the most populated ones, but these are the 
families responsible for the quadratic scaling. 

Thus, the "pure" quadratic scaling is observable in the largest transcription factor families. Collecting 



all the families, wemeasure a lower exponent in empirical data (close to 1.6). Supplementary Figure Sll 
explains this behavior, showing the total contribution of the smaller transcription factor families. These 
families collectively show a lower exponent (close to 1). Thus, we can interpret the lower collective 
exponent as an effect of family size (i.e., in the language of statistical mechanics, a "finite-size" effect) 
connected to the fact that for smaller family size, the innovation move is more relevant and thus the 
family expansion process is slower. The same effect is present in our simulations (see Supplementary 
Figure [sB]) 
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400 




Total Number of Families 

Supplementary Figure SI. Scaling of the number of families in the three main functional 
categories. Linear scaling behaviour of the number of families in three important functional categories 
versus total number of families from empirical data (for 753 bacteria in the SUPERFAMILY database). 
The slopes for the three linear laws are 0.01 (Translation), 0.03 (Regulation of Transcription) and 0.47 
(Metabolic Processes). 
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Supplementary Figure S2. Transcription factor families. Boxplot of the number of 
transcription factor domain families versus total number of domain families (data from 753 
SUPERFAMILY bacteria). There appears to be a roughly linear scaling. This means that the number 
of TF domain families is compatible with a null hypothesis of independent addition model. 
Charoensawan et al \22^ propose that the number of TF families follows a linear scaling with genome 
size. If this were to be the case, the innovation dynamics of transcription factor families should be 
distinct form other families. In fact, if frpin) n, since the total number of families is sublinear, 
f{n) ~ n" in the CRP (Figure[l]), then one would have /tf /^~", which is not confirmed by the 
SUPERFAMILY data analyzed here. 




Supplementary Figure S3. Comparison between models lb and II. Comparison between 
simulation of the correlated duplication model lb (left panel) and evolutionary potentials (right panel) 
model variants with empirical data. Simulations are run at a = 0.3 and 9 = 140. (a) Number of TFs 
domains vs. number of metabolic domains (the blue boxplot corresponds to simulations, red circles to 
empirical data), (b) Number of metabolic domains vs. total number of domains (the blue boxplot 
corresponds to simulations, red circles to empirical data), (c) Family population histograms restricted 
to the transcription factor functional category (black circles are simulations, magenta lines empirical 
data), (d) Family population histograms restricted to the metabolism functional category (black circles 
are simulations, magenta lines empirical data). 
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Supplementary Figure S4. Simulations of the correlated duplication model la for two 
categories (transcription factors and metabolic enzymes). The plots are obtained from 1000 
realizations with a ~ 0.3, 9 ~ 140 and U — 7000. The observables are the same as in figure S3 (a) 
scaling of the number of transcription factors with the number of metabolic enzymes, (b) Number of 
families as a function of genome size n. (c) Family population (cumulative) histograms, (c) Family 
population histograms restricted to the families belonging to the transcription factor functional 
category. 
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Supplementary Figure S5. Linear relation between the number of domains and the 
number of genes, (a) Number of Domains vs. number of protein-coding genes for the 753 bacteria in 
the SUPERFAMILY database. There are, on average, 1.45 domains per gene, (b) Linear scaling 
behaviour of the number of TF domains vs. number of TF genes. There are, on average, 1.09 TF 
domains in a TF gene. 




Supplementary Figure S6. Simulation of the number of transcription factor families. 

Comparison between empirical data and simulations of the number of transcription factor domain 
families plotted against total number of families. The scaling is empirically linear, i.e. the number of 
TF domain families is reproduced by a null hypothesis of independent addition model. The choice of 
the parameter is 0.035. 
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Supplementary Figure S7. Normalization constant inthe model with evolutionary 
potentials (model II). Behavior of the ratio C{n)/n, where C{n) is the normahzation factor for the 
evolutionary potential model. Data from simulations with three categories run at parameters a — 0.3 
and 9 = 140. C(n) is linear with n in the range of empirical genome sizes, it then looses linearity, to 
become linear only asymptotically. 
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Supplementary Figure S8. Simulation of model lb with 10 + 1 categories. 10 categories are 
slaved to one master category with different correlation laws, which determine the observed exponents). 
Panel A, B and C show the simulations of the population of three categories (respectively with jc equal 
to 1, and —0.7). The red lines are power-law fits of the simulated data. Panel D shows the power-law 
fits of the simulated data for all ten categories. 
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Supplementary Figure S9. Correlation matrix for two sets of genomes with different 

sizes. Left panel: Correlation matrix for genomes with size < 4000. Right panel: Correlation matrix 
for genome with size > 4000. The correlations do not depend on size. 
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Supplementary Figure SIO. Most populated transcription factor superfamilies. Boxplots 
for the population of the six most populated superfamilies of TF DNA-binding domains (y-axis in each 
panel) versus number of domains of each genome (x-axis in each panel) . The presence of scaling laws 
appears likely for the three most populated families and arguable for the first five. Red lines represent 
best power law fit (1.8 for Winged Helix ,2.1 for Homeodomain-like and 1.7 for C-terminal effector) 
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Supplementary Figure Sll. Scaling of the least populated transcription factor 
superfamilies. Collective scaling of the number of transcription factor domains after removing the six 
globally most populated families. While a few genomes show large fluctuations from the typical trend, a 
clear scaling is still observable for most genomes, with a fitted exponent equal to 0.9 
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Supplementary Figure S12. MARCO Finite-size effects on the scaling exponent <^tf for 
transcription factors in simulations of model lb. The plot shows the fitted exponent (y-axis) 
from the curve of the number of transcription factor domains versus the number of metabolic enzymes 
in 500 simulated realizations of model lb with parameter a = 0.3 and = 140. Each point on the x-axis 

corresponds to simulated data stopped at a given size n. The mean- field prediction (Ctf ~ 2) is reached 
only in the limit n — oo. This plot shows that the fitted exponent 1.6 (instead of 2) for the growth of 
transcription factors vs metabolic domains is due to a finite-size effect of a process that produces an 
exponent 2 in the large-n limit. The same effect is present in models la and II. 
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Supplementary Figure S13. Ratio between exponents of family population histograms. 

The plot reports the ratio j3/ Ptf between the exponent of the total family population histograms and 
the histograms restricted to the transcription factor families (see Figure [s] in the main text), as a 
function of genome size. The values of the ratio are distributed around 1.6 and the fluctuation range 
decreases with increasing genome size. 
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Supplementary Table SI. Fitted values of Xc and offsets from vs / for the ten largest 
functional categories 





Ac 


Xc 


Reduced chi square 


Transcription Factors 


2.2 ±0.4 


0.0267 ±0.0006 


4.5 


Translation 


61.0 ±0.35 


0.0133 ±0.0006 


3.9 


Small molecule binding 


3.0 ±0.2 


0.01 ±0.0002 


0.9 


Nucleotide transport and metabolism 


5.6 ±0.3 


0.02 ± 0.0005 


3.1 


DNA replication/repair 


9.5 ±0.6 


0.0437 ± 0.0009 


9.8 


Inorganic ion transport and metabolism 


0.2 ±0.4 


0.0272 ± 0.0005 


3.5 


Redox 


-7.6 ±0.5 


0.0592 ± 0.0008 


7.9 


Transferases 


5.3 ±0.2 


0.0213 ±0.0004 


1.6 


Other enzymes 


-14.8 ± 1.1 


0.155 ±0.002 


35.7 


Signal transduction 


-3.2 ±0.3 


0.0282 ± 0.0005 


3.3 



The number of evolutionary families belonging to a functional category follows a linear law in empirical 
data. The table reports fits of fc ~ Ac + Xcf from the plots in Figure [2] of the main text, where fc 
represents the number of families in category c on all genomes and / is the total number of families on 
the genome. The third column is the reduced chi square. 



Supplementary Table S2. Data of fitted exponents from Figure [6| of the main text, for the 
ten largest functional categories 





Cc 


/3c 


Transcription Factors 


1.6 ±0.02 


0.47 ±0.01 


Translation 


0.176 ±0.003 


1.46 ±0.02 


Small molecule binding 


0.918 ±0.006 


0.25 ±0.01 


Nucleotide transport and metabolism 


0.61 ±0.01 


0.71 ±0.01 


DNA replication/repair 


0.54 ±0.01 


0.9 ±0.01 


Inorganic ion transport and metabolism 


1.40 ±0.02 


0.46 ±0.01 


Redox 


1.3 ±0.01 


0.52 ±0.02 


Transferases 


1.09 ±0.01 


0.43 ±0.01 


Other enzymes 


1.09 ±0.01 


0.64 ±0.01 


Signal transduction 


1.77 ±0.03 


0.4 ±0.01 
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Supplementary Table S3. Correlation coefficients between the populations of metabolic 
functional categories 





En 


0- 


Ph 


Aa 


N 


Co 


Nn 


Ca 


Li 


Ps 


Cc 


2M 


Rx 


Tr 


Ot 


En 


1 


0.14 


0.07 


0.55 


0.23 


0.36 


0.19 


-0.06 


-0.08 


-0.14 


0.02 


0.22 


0.31 


-0.10 


-0.004 


0- 


0.14 


1 


0.29 


0.15 


0.11 


0.43 


-0.09 


-0..52 


0.35 


-0.29 


0.13 


0.19 


0.47 


0.09 


0.05 


Ph 


0.07 


0.29 


1 


0.12 


0.21 


-0.02 


-0.09 


-0.16 


-0.21 


0.14 


-0.18 


0.15 


0.06 


0.16 


-0.05 


Aa 


0..55 


0.15 


0.12 


1 


0.08 


0.39 


0.19 


-0.14 


-0.07 


-0.22 


0.01 


0.07 


0.40 


0.02 


0.14 


N 


0.23 


0.11 


0.21 


0.08 


1 


-0.13 


-0.08 


-0.003 


-0.14 


-0.09 


0.09 


0.26 


0.04 


-0.03 


-0.02 


Co 


0.36 


0.43 


-0.02 


0.39 


-0.13 


1 


0.14 


-0.33 


0.44 


-0.37 


-0.04 


0.08 


0.51 


0.12 


0.16 


Nu 


0.19 


-0.09 


-0.09 


0.19 


-0.08 


0.14 


1 


-0.03 


-0.09 


-0.10 


-0.02 


-0.10 


0.03 


-0.11 


-0.13 


Ca 


-0.06 


-0.52 


-0.16 


-0.14 


-0.003 


-0.33 


-0.03 


1 


-0.20 


0.53 


-0.18 


0.02 


-0.46 


-0.11 


0.16 


Li 


-0.08 


0.35 


-0.21 


-0.07 


-0.14 


0.44 


-0.09 


-0.20 


1 


-0.35 


0.15 


0.18 


0.06 


0.13 


0.20 


Ps 


-0.14 


-0.29 


0.14 


-0.22 


-0.09 


-0.37 


-0.10 


0.53 


-0.35 


1 


-0.12 


-0.05 


-0.36 


0.09 


-0.07 


Cc 


0.02 


0.13 


-0.18 


0.01 


0.09 


-0.04 


-0.02 


-0.18 


0.15 


-0.12 


1 
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0.01 
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-0.31 


2M 


0,22 


O.l'l 
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0,2(i 


0,0,*^ 


-0,10 
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Hx 


0.31 


0.47 


U.lKi 
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U.U3 
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0.12 


-0.11 


-0.11 


0.13 


0.09 


-0.22 


0.20 


-0.10 


1 


0.17 


Ot 


-0.004 


0.05 


-0.05 


0.14 


-0.02 


0.16 


-0.13 


0.16 


0.20 


-0.07 


-0.31 


0.08 


0.14 


0.17 


1 



Pearson's correlation coefficients between the populations of 24 different metabolic functional categories 

from the SUPERFAMILY database for 753 bacteria. Correlations arc calculated from fluctuations of 
categories from the average trend (see Methods). Both correlation and anticorrelation are present 
between categories. Metabolism categories are highly (anti-)correlated. We used the following short 
forms for the metabolic functional categories: En = Energy p/c, e- = Electrons transfer, Ph = 
Photosynthesis, Aa = Amino acids m/tr, N = Nitrogen m/tr, Co = Coenzyme m/tr, Nu = Nucleotide 
m/tr, Ca = Carbohydrate m/tr, Li = Lipid m/tr, Ps = Polysaccharide m/tr, Ce = Cell envelope m/tr, 
2M = Secondary metabolism, Rx = Redox, Tr — Transferases, Ot = Other enzymes. Where m/tr 
stands for "metabolism and trasportation" and p/c means "production and conversion". 



Supplementary Table S4. P- Values of correlation coefficients betvireen the populations of 
metabolic functional categories 
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P-valucs of the Pearson's correlation coefficients between the populations of 24 different metabolic 
functional categories from the SUPERFAMILY database for 753 bacteria (the most significant values 
are in boldface). Correlations arc calciilatcid from fluctuations of categories from the average trend (see 
Methods). The (anti-)corrclation is statistically signiflcant for the most of the metabolic categories. We 
used the following short forms for the metabolic functional categories: En = Energy p/c, e- = Electrons 
transfer, Ph = Photosynthesis, Aa = Amino acids m/tr, N = Nitrogen m/tr, Co = Coenzyme m/tr, Nu 
= Nucleotide m/tr, Ca — Carbohydrate m/tr, Li — Lipid m/tr, Ps = Polysaccharide m/tr, Ce — Cell 
envelope m/tr. 2M = Secondary metabolism, Rx = Redox, Tr = Transferases, Ot = Other enzymes. 
Where m/tr stands for "metabolism and trasportation" and p/c means "production and conversion". 



